Optimizing the atom types of proteins through iterative knowledge-based potentials
Wang Xin-Xiang, Huang Sheng-You
School of Physics, Huazhong University of Science and Technology, Wuhan 430074, China

 

† Corresponding author. E-mail: huangsy@hust.edu.cn

Abstract

Knowledge-based scoring functions have been widely used for protein structure prediction, protein–small molecule, and protein–nucleic acid interactions, in which one critical step is to find an appropriate representation of protein structures. A key issue is to determine the minimal protein representations, which is important not only for developing of scoring functions but also for understanding the physics of protein folding. Despite significant progresses in simplifying residues into alphabets, few studies have been done to address the optimal number of atom types for proteins. Here, we have investigated the atom typing issue by classifying the 167 heavy atoms of proteins through 11 schemes with 1 to 20 atom types based on their physicochemical and functional environments. For each atom typing scheme, a statistical mechanics-based iterative method was used to extract atomic distance-dependent potentials from protein structures. The atomic distance-dependent pair potentials for different schemes were illustrated by several typical atom pairs with different physicochemical properties. The derived potentials were also evaluated on a high-resolution test set of 148 diverse proteins for native structure recognition. It was found that there was a crossover around the scheme of four atom types in terms of the success rate as a function of the number of atom types, which means that four atom types may be used when investigating the basic folding mechanism of proteins. However, it was revealed by a close examination of typical potentials that 14 atom types were needed to describe the protein interactions at atomic level. The present study will be beneficial for the development of protein related scoring functions and the understanding of folding mechanisms.

1. Introduction

Appropriate representation of protein structures is important in computational structural biology. It is associated with not only the development of scoring functions but also the understanding of protein folding mechanism. Coarse-grained model is one of the methods to represent protein structures and often applied to reduce computational cost. A key factor in the model is to determine the minimal protein representations, residue-based or atom-based, without sacrificing essential precision. Simplifying twenty amino acids into fewer numbers of representative alphabets has been widely studied in protein folding,[15] molecular docking,[6] protein structure prediction,[7,8] protein design,[9] protein function,[10] protein classification,[1113] and protein sequence alignment.[14,15] Baker et al. designed a 57-residues protein Src SH3 with a reduced alphabet using five amino acid types.[5] Motivated by their work, Wang and Wang obtained an optimal reduction with five types of residues that has the same form as the simplified palette of Baker and coworkers based on the concept of mismatch between a reduced interaction matrix and the Miyazawa and Jernigan (MJ) matrix.[1] In coarse-grained molecular simulations, lattice chain models of two residue types or more (HP, HNP, IAGEK, etc.) provided a good understanding on the folding process of natural proteins.[4,9,1618] Although residue-based typing issues in protein modeling, folding, and design have been studied by researchers,[1,3,1923] the optimal atom classification for proteins has received little attention. In fact, during the scoring process of protein structure prediction, protein–protein, protein–ligand, and protein–nucleic acid interactions, atom-based potentials are often required for the accurate evaluation of these interactions.[2428] For years, various protein atom typing schemes have been developed to characterize the interactions between atoms.[2934] However, the optimization about atom typing remains an important issue. Current approaches normally classify protein atoms based on their physical and chemical properties. These properties include atomic number, charge, polarity, hydrophobicity, hydrogen bond, local chemical and protein secondary structure environment, etc.[3539]

Despite the successes of current atom typing schemes on some systems, their classification methods for atom types are all kind of arbitrary and depend on specific systems studied. Given the 167 heavy atoms of 20 amino acids, protein atoms could be grouped into 1 to 167 types.[36,37] If a scheme has a small number of atom types, it will make the potentials of scoring functions simple and fast, but the resolution of the potentials will be sacrificed in characterizing the atomic interactions. On the contrary, if a scheme has a large number of atom types, the corresponding potentials will have a better resolution in describing the interactions, but the potentials may suffer from the slow speed due to more possible interacting pairs and more errors due to insufficient statistics in the derivation of scoring functions.[3235] Therefore, an appropriate typing scheme is needed to achieve a good balance between accuracy and resolution, especially for knowledge-based scoring functions.[27]

In this work, we have used 11 schemes to categorize protein atoms based on their physical and chemical environment in proteins. The 11 schemes gave 1 to 20 atom types when different levels of details for atoms’ environments were taken into account. Using a statistical mechanics-based iterative method[35] and a training set of 1225 proteins,[40] we have derived 11 sets of knowledge-based pair potentials based on the 11 protein atom typing schemes. The goodness of different atom typing schemes were assessed based on the ability of the developed scoring functions in discriminating native structures from decoys on a test set of 148 proteins with 500–1600 high resolution (HR) decoys for each protein.[40] We have also examined the derived knowledge-based potentials of some typical atom pairs to obtain a relation between the accuracy of the potentials and the dimension of the parameters. The present work provides a reference on the optimal number of atom types for proteins in the development of scoring functions and will be useful for the study of protein design and structure prediction.

2. Materials and methods
2.1. Eleven protein atom typing schemes

The 11 schemes have been used to classify the 167 heavy atoms of proteins based on their physical, chemical, and functional environment. The details of the atom typing schemes are illustrated in Fig. 1. The number of atom types for each scheme increases when more details are considered. In the present study, the simplest atom typing scheme is to group all the heavy atoms into one type. The most advanced scheme is characterized with twenty protein atom types (Table 1 and Fig. 1), which has been used in our previous study and demonstrated efficient in discriminating native structures from decoys.[35] Starting from twenty protein atom types, “similar” atom types are gradually combined together according to their similar properties, and thus the dimension of the atom typing scheme becomes smaller and smaller, until all the atom types are combined into one type.[27,35] The 11 schemes result in 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 atom types, respectively. During the classification of the protein atoms, one important aspect is to consider the effects of physical and chemical environments like van der Waals interactions, electrostatics, hydrophobicity, entropy effect, and hydrogen bonding, but the ultimate goal of atom typing is to develop an accurate scoring function based on the atom typing scheme.[25,26,35,4143] Therefore, we have derived the corresponding scoring functions of pair potentials based on different atom typing schemes through a statistical mechanics-based iterative method.[35]

Fig. 1. (color online) Eleven typing schemes of protein heavy atoms based on their physical and chemical environments. The numbers on the left stand for the numbers of grouped atom types. The definitions of the final 20 atom types are listed in Table 1.
Table 1.

List of the 20 atom types used in the iterative method for the heavy atoms in 20 standard amino acids. The symbol “*” stands for any residue.

.
2.2. The iterative method to extract effective potentials

During our computations, a large training set of experimentally determined protein structures and computationally generated decoys was used to derive the scoring function of pair potentials.[40] Through a statistical mechanics-based iterative method, all the native structures are expected to have the lowest energy scores compared with their respective decoys. The basic idea of the iteration process is described by the following expressions:[35]

where i and j represent the types of a protein atom pair, is the statistical interaction potential of atom pair ij at a distance of r in the k-th iterative step, is the improved potential through iteration in the ( )-th step, λ represents a convergence parameter with , is the Boltzmann constant, and T is the absolute temperature. Without loss of generality, is set to unit one in the present study. is the pair distribution function for the experimentally observed (i.e. native) structures. is the weighted average of the predicted pair distribution functions for the k-th step based on a training set of experimentally determined native protein structures and computationally generated decoys according to the Boltzmann’s probability distribution. To reduce the local correlation effects of the covalent bonds and obtain an unbiased scoring function, only the interatomic interactions between non-neighboring residues were considered in the calculation of pair distribution functions and interaction potentials. During the iteration, the standard potentials of mean force (PMF)[35,44] were used as the initial potentials,

The statistical energy score of each native structure or non-native/decoy structure at the k-th iterative step can be calculated as follows:

where P stands for the number of proteins in the training set, Dp stands for the number of decoys for the p-th protein, and Spq stands for the q-th decoy of the p-th protein, with , , and the native structures correspond to q = 0. is the number density of atom pair ij at distance r for the structure of Spq. As the size of each protein structure is finite, there exists a distance cutoff to consider the pairs between atoms, which was set to be 12 Å in this work. N is the number of protein atom types for a typing scheme (Fig. 1). The iteration is repeated until a maximum number of native structures are discriminated from decoys
Thus, at the end of the iteration, we can obtain a set of knowledge-based pair potentials for each atom typing scheme. The details on the pair distribution functions and the statistical mechanics theory of the iteration method have been described in our previous study.[35]

The iterative method is robust and has been widely used in studying protein–protein interactions, protein–ligand interactions, protein–RNA/DNA interactions, and protein structure prediction.[3235,42,45] The iterative method circumvents the long-standing reference state problem in the development of knowledge-based scoring functions by improving the pair potentials iteratively through comparison of the physics-based pair distribution functions. The scoring function ITScorePP, that was derived by the similar iterative method, has proven to be effective in past critical assessment of prediction of interactions (CAPRI) experiments.[46,47]

2.3. Training and test data sets

In the present work, we have used the training set of 1225 proteins and the test set of 148 proteins that were generated by Rajgaria et al.[40] All the proteins, that were selected by Zhang and Skolnick, are nonredundant single domain proteins with a maximum pairwise sequence similarity of 35%.[48] The length of these proteins ranges from 41 to 200 amino acids. This set also has a uniformly distribution of α, and protein structures.

Out of the 1225 proteins, we have randomly chosen 500 proteins as the training set for our iterative computation. The similar iteration was run 10 times, and the final potentials were the average values over all the resulting iterative potentials of 10 runs. For the sake of computational efficiency, we have selected 400 decoys for each protein in the iteration. As our main purpose is to obtain the relationship between the effectiveness of the statistical interaction potentials and the dimension of the atom typing schemes, we have focused on the comparison between the effectiveness of different atom typing schemes in protein structure prediction so as to obtain an optimal atom typing scheme. Therefore, as long as the statistics for atom pairs are sufficient, the quantity of proteins in the training set and the number of decoys for each protein are relatively less relevant factors in our calculations. Given the interaction pair potentials for different atom typing schemes, we have closely examined the potential energy curves of several typical atom pairs with different physicochemical properties. All the derived scoring functions of potentials were also evaluated in terms of native structure recognition and five other parameters on the test set of 148 proteins with 500–1600 HR decoys for each protein.

Considering the set of randomly selected 500 nonhomologous proteins, our large training database results in significant statistics of frequencies for 209 of the 210 possible pairs in the case of 20 atom types. For example, the frequency is 1454923 for C3C–C3C pair. It is expected that the average frequencies will be higher when fewer atom types are adopted. The high frequencies of most atom pairs occurring in the training set warrant sufficient statistics to derive the distance-dependent pair potentials. Therefore, the ensemble is large enough in terms of atom pair frequencies for statistical study.

3. Results and discussion
3.1. The potentials for different atom typing schemes

Through a statistical mechanics-based iterative method, we have developed 11 sets of atomic distance-dependent pair potentials corresponding to 11 atom typing schemes. To illustrate how the atom typing schemes impact the derived statistical interaction potentials and obtain an overall physical picture of the atomic interactions, we have plot the potential energy curves of several typical atom pairs with different physicochemical properties when the protein atoms were categorized into one to 20 atom types, respectively. Similar to our previous study,[35] we have chosen several representative atom pairs for demonstration in the present work. They stand for electrostatic interactions [O2-]–[N2+] (e.g., ASP_OD2–ARG_NH1) and [O2-]–[N3+] (e.g., ASP_OD2–LYS_NZ) (Figs. 2 and 3), hydrogen bonding interactions [O2M]–[N2N] (e.g., *_O–*_N) and [O3H]–[O3H] (e.g., SER_OG–TYR_OH) (Figs. 4 and 5), and hydrophobic interactions [Car]–[Car] (e.g., PHE_CZ–TRP_CH2) and [C3C]–[C3C] (e.g., ILE_CD1–LEU_CD2) (Figs. 6 and 7), respectively.

Fig. 2. (color online) Comparison of the derived pair potentials of [O2–]–[N2+] for different atom typing schemes with 1 to 20 atom types.
Fig. 3. (color online) Comparison of the derived pair potentials of [O2–]–[N3+] for different atom typing schemes with 1 to 20 atom types.
Fig. 4. (color online) Comparison of the derived pair potentials of [O2M]–[N2N] for different atom typing schemes with 1 to 20 atom types.
Fig. 5. (color online) The comparison of the derived pair potentials of [O3H]–[O3H] for different atom typing schemes with 1 to 20 atom types.
Fig. 6. (color online) The comparison of the derived pair potentials of [Car]–[Car] for different atom typing schemes with 1 to 20 atom types.
Fig. 7. (color online) The comparison of the derived pair potentials of [C3C]–[C3C] for different atom typing schemes with 1 to 20 atom types.

Several common trends can be observed from the pairs of knowledge-based potentials under different atom typing schemes. First, with the increasing number of classified atom types, the equilibrium positions of the potentials tend to move left. Second, the depths of the potential wells become deeper with more atom types classified. For clarity, the equilibrium positions and depths of the potential wells of all typical atom pairs are listed in Table 2, and illustrated in Figs. 8 and 9. It can be seen from the figures that the equilibrium positions and well depths become relatively stable at the alphabet of 14 atom types. Third, the pair potentials gradually converge into a stable curve, though the converging speed depends on the specific atom pairs. For electrostatic interactions, the derived potentials become stable when the protein atoms are grouped into eight or more atom types (Figs. 2 and 3). For hydrogen bond interactions, the pair potentials converge at the scheme of four atom types, and become consistent for the schemes of eight or more atom types (Figs. 4 and 5). For hydrophobic interactions, the potential curves start to merge together when 14 or more atom types are used (Figs. 6 and 7). Therefore, when considering all the atom pair potentials of different atom typing schemes, categorizing protein heavy atoms into 14 atom types seems to be the optimal choice in terms of atomic pair potentials.

Fig. 8. (color online) The equilibrium positions of six typical pair potentials under different atom typing schemes.
Fig. 9. (color online) The well depths of six typical pair potentials under different atom typing schemes.
Table 2.

The equilibrium positions and well depths of several typical pair potentials for 1–20 atom tying schemes.

.
3.2. Performances in discriminating native structures

Atom typing is a critical aspect in the development of knowledge-based scoring functions. Therefore, an important criterion for the goodness of an atom typing scheme is the ability of the scoring function with the atom typing scheme in discriminating native structures from decoys. Therefore, we have evaluated the performances of the 11 scoring functions derived from the 11 atom typing schemes through a statistical mechanics-based iteration method on the high-resolution (HR) decoy set of 148 proteins generated by Rajgaria et al.[40] Table 3 lists the six assessment parameters for different atom typing schemes, in which the success rate of identifying native structures, the correlations between the energy scores and the RMSDs of decoys, and the Z-score of native structures are show in Figs. 1012, respectively. Here, the success rate is defined as the number of the proteins with native structures as rank #1 divided by the total number of the proteins in the test set. The Z-score is defined as

where is the average energy score of all the decoys for a protein, is the energy score of the native structure, and σ is the energy deviation of all the decoys. The Z-score measures the relative energy separation between the native structure of a protein and its decoys.

Fig. 10. (color online) Success rates of the proteins with native structures as rank 1 for 11 atom typing schemes with 1 to 20 atom types.
Table 3.

Testing results for 11 atom typing schemes on HR decoy sets.[40]

.

One common feature can be found from five assessment parameters (except Z-score) as a function of the number of atom types. Namely, the values of the five parameters change fast at the beginning and then become relatively stable with the increasing number of atom types (Table 3). There is a crossover between fast and slow changes at the scheme of four atom types (Figs. 10 and 11). For example, there is a maximum at the scheme of four atom types for the Z-score of native structures, before which the Z-score increases fast and after which the Z-score drops again (Fig. 12). Therefore, the scheme of four atom types seems to give the best balance between the accuracy and resolution in terms of the success rate of a scoring function in discriminating native structures. It can also be seen from Fig. 10 that the scheme of 14 atom types gives a slightly higher success rate than the other schemes. A similar trend was also found in the work by Mintseris et al.[28] Namely, the highest peak of effective mutual information (MI) occurs at alphabet of size four, corresponding to hydrophobic, polar, positive, and negative types, and the effectiveness of all pair-wise potentials flattens out at alphabet of size 12.[28]

Fig. 11. (color online) The average Pearson correlation coefficients (CC) between the energy scores and the RMSD of decoys for 11 atom typing schemes with 1 to 20 atom types.
Fig. 12. (color online) The average Z-score for native structures for 11 atom typing schemes with 1 to 20 atom types.

From the above assessments over different atom typing schemes, we have found that the scheme of four atom types did the best in low dimensions while the scheme of 14 atom types obtained the best performance in high dimensions. The four atom types correspond to the natural atom typing scheme of C, N, O, and S. The scheme can give excellent average Z-score and rank, and the other parameters are not far from the maximum values in the high dimensions. Therefore, we may use the scheme of four atom types as rough screening (low precision) for protein structure prediction, as the simple scheme can greatly cut down the computational cost. When accurate prediction is needed, we can turn to the scheme of 14 atom types to evaluate structures in high resolution.

4. Conclusion

In this work, we have addressed the protein atom typing problem by categorizing protein heavy atoms into 11 atom typing schemes. The knowledge-based pair potentials for the 11 atom typing schemes were derived using a statistical mechanics-based iterative method. The performances of different atom typing schemes were evaluated by the comparison of the derived knowledge-based pair potentials and the ability of the corresponding scoring functions in discriminating native structures from decoys. It was found that the derived pair potentials started to converge when 14 or more atom types were used, while four types were enough to obtain a satisfactory success rate in native structure recognition. The results suggested that the number of atom types could range from 4 to 14 in practical applications depending on the studied systems. The scheme of four atom types (i.e., C, N, O, and S) could be used for assessing the overall quality of protein structures, while an accurate description of interactions for protein structures at atomic level may require a finer scheme of 14 atom types. The present study provides a basic guidance for the classification of protein atoms, and is expected to benefit the development of scoring functions and the understanding of interaction mechanisms in proteins.

Reference
[1] Wang J Wang W 1999 Nat. Struct. Mol. Biol. 6 1033
[2] Wang J Wang W 2000 Phys. Rev. 61 6981
[3] Fan K Wang W 2003 J. Mol. Biol. 328 921
[4] Dill K A 1985 Biochemistry 24 1501
[5] Riddle D S Santiago J V Bray-Hall S T Doshi N Grantcharova V P Yi Q Baker D 1997 Nat. Struct. Biol. 4 805
[6] Launay G Mendez R Wodak S Simonson T 2007 BMC Bioinform. 8 270
[7] Luthra A Jha A N Ananthasuresh G K Vishveswara S 2007 J. Biosci. 32 883
[8] Li T Fan K Wang J Wang W 2003 Protein Eng. 16 323
[9] Walter K U Vamvaca K Hilvert D 2005 J. Biol. Chem. 280 37742
[10] Akanuma S Kigawa T Yokoyama S 2002 Proc. Natl. Acad. Sci. USA 99 13549
[11] Peterson E L Kondev J Theriot J A Phillips R 2009 Bioinformatics 25 1356
[12] Albayrak A Out H H Sezerman U O 2010 BMC Bioinform. 11 428
[13] Etchebest C Benros C Bornot A Camproux A C de Brevern A G 2007 Eur. Biophys. J. 36 1059
[14] Melo F Marti-Renom M A 2006 Proteins 63 986
[15] Cannata N Toppo S Romualdi C Valle G 2002 Bioinformatics 18 1102
[16] Shi Y Z Wu Y Y Wang F H Tan Z J 2015 Chin. Phys. B 24 116802
[17] Zhang W Sun Z B Zou X W 2005 Chin. Phys. Lett. 22 2133
[18] Li W F Zhang J Wang J Wang W 2015 Acta Phys. Sin. 64 098701 (in Chinese) http://wulixb.iphy.ac.cn/EN/abstract/abstract64179.shtml#
[19] Solis A D 2015 Proteins 83 2198
[20] Huang J T Wang T Huang S R Li X 2015 Proteins 83 631
[21] Levitt M Warshel A 1975 Nature 253 694
[22] Wang J Wang W 2002 Phys. Rev. 65 041911
[23] Chen H Zhou X Ou-Yang Z C 2002 Phys. Rev. 65 061907
[24] Liu Y Zeng J Gong H 2014 Proteins 82 2383
[25] Yang Y Zhou Y 2008 Protein Sci. 17 1212
[26] Yang Y Zhou Y 2008 Proteins 72 793
[27] Mintseris J Weng Z 2004 Genome Informatics 15 160
[28] Mintseris J Pierce B Wiehe K Anderson R Chen R Weng Z 2007 Proteins 69 511
[29] Zhao Y Cheng T Wang R 2007 J. Chem. Inf. Model. 47 1379
[30] Jiang L Gao Y Mao F Liu Z Lai L 2002 Proteins 46 190
[31] Zhang M Chen C He Y Xiao Y 2005 Phys. Rev. 72 051919
[32] Huang S Y Zou X 2014 Nucleic Acids Res. 42 e55
[33] Huang S Y Zou X 2014 Proteins 72 557
[34] Huang S Y Zou X 2006 J. Comput. Chem. 27 1866
[35] Huang S Y Zou X 2011 Proteins 79 2648
[36] Zhou H Zhou Y 2002 Protein Sci. 11 2714
[37] Zhou H Skolnick J 2011 Biophys J. 101 2043
[38] Mitchell J Alex A Snarey M 1999 J. Chem. Inf. Comput. Sci. 39 751
[39] Labute P 2005 J. Chem. Inf. Model. 45 215
[40] Rajgaria R McAllister S R Floudas C A 2008 Proteins 70 950
[41] Lu M Dousis A D Ma J 2008 J. Mol. Biol. 376 288
[42] Huang S Y Zou X 2010 J. Chem. Inf. Model. 50 263
[43] Kadukova M Grudinin S 2016 J. Chem. Inf. Model. 56 1410
[44] Xu W X Li Y Zhang J Z 2012 Chin. Phys. Lett. 29 068702
[45] Yan Y M Zhang D Zhou P Li B T Huang S Y 2017 Nucleic Acids Res. 45 w365
[46] Huang S Y 2014 Drug Discov Today 19 1081
[47] Lensink M F Velankar S Kryshtafovych A Huang S Y et al. 2016 Proteins 84 323
[48] Zhang Y Skolnick J 2004 Proc. Natl. Acad. Sci. USA 101 7594